Downloading SRA reads from archives
Sources of reads
Microbiome read sequencing data may be obtained from different sources. The most common ones include:
- Reads obtained directly from a sequencing platforms by investigators.
- Reads downloaded from the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
- Reads synthesized using sequencing simulators.
Snakemake workflow for downloading SRA reads
A tentative snakemake workflow that defines rules for downloading fastq sequences from SRA in a DAG (directed acyclic graph) format. A detailed interactive snakemake report is available here.
Installing SRA Toolkit
- Navigate to where you want to install the tools, preferably the home directory.
- For more information click here.
Demo on MAC OS
curl -LO https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-mac64.tar.gz
tar -xf sratoolkit.3.0.0-mac64.tar.gz
export PATH=$HOME/sratoolkit.3.0.0-mac64/bin/:$PATH
Create a cache root directory
mkdir -p ~/ncbi
echo '/repository/user/main/public/root = "cache_directory"' > ~/ncbi/user-settings.mkfg
Confirm sra toolkit configuration
- The
vdb-config -icommand below will display a blue colored dialog. - Use tab or click
cto navigate to cache tab. - Review the configuration then save
sand exitx.
vdb-config -i
A screenshot of the SRA configuration.
For more information click here.
Alternative method
We can create an environment and install essential tools in it.
Example, sradb using environment.yml.
name: sradb
channels:
- conda-forge
- bioconda
dependencies:
- sra-tools
- entrez-direct
- pysradb
mamba create -c bioconda -c conda-forge sradb -file environment.yml
Downloading multiple fastq files
- Make sure that the
fasterq-dumpis in the path. - Type
which fasterq-dumporfasterq-dump --helpto confirm. - Must specify the output and temporary files.
- It is possible to specifies a range of SRA accessions in a
for loop.
Example code for download reads for SRA accessions ranging from SRR7450706 to SRR7450761
for (( i = 706; i <= 761; i++ ))
do
time fasterq-dump SRR7450$i \
--split-3 \
--force \
--skip-technical \
--outdir data/reads \
--temp data/temp \
--threads 4
done
Compressing and uncompressing files
The microbiome fastq files are usually very large. Compressing them may save lots of space.
Uncompressing with bash
gunzip data/reads/*.gz
Compressing with bash
gzip data/reads/*.fastq
Resizing Fastq files
- Sometimes we want to extract a small subset to test the bioinformatics pipeline.
- You can resize the fastq files using the
seqkit samplefunction[1].
Example extracting only 1% of the paired-end metagenomics sequencing data.
This bash script extracts 1% of the reads from only two sample (SRR10245277 to SRR10245280)
mkdir -p data
for i in {77..80}
do
cat SRR102452$i\_1.fastq \
| seqkit sample -p 0.01 \
| seqkit shuffle -o data/SRR102452$i\_1_sub.fastq \
| cat SRR102452$i\_2.fastq \
| seqkit sample -p 0.01 \
| seqkit shuffle -o data/SRR102452$i\_2_sub.fastq
done
References
Appendix
Project main tree
.
├── LICENSE
├── README.md
├── config
│  ├── config.yaml
│  └── samples.tsv
├── dags
│  ├── rulegraph.png
│  └── rulegraph.svg
├── data
│  ├── metadata
│  ├── reads
│  ├── temp
│  └── test
├── docs
│  └── env_spec_file.txt
├── images
│  ├── smkreport
│  ├── sra.png
│  └── sra_config_cache.png
├── index.Rmd
├── library
│  ├── apa.csl
│  ├── imap.bib
│  └── references.bib
├── report.html
├── results
│  ├── project_tree.txt
│  └── run_accessions.txt
├── styles.css
└── workflow
├── Snakefile
├── envs
├── reports
├── rules
├── schemas
└── scripts
18 directories, 18 files
Screenshot of interactive snakemake report
The interactive snakemake HTML report can be viewed by opening the
report.htmlusing any compatible browser. You will be able to explore the workflow and the associated statistics. You can close the left bar to get a more expansive display view.
Troubleshooting of FAQs
- Question
- Question
-
Answer
-
Answer